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Abstract 

Many biological characteristics of evolutionary inter- 
est are not scalar variables but continuous functions. 
Here we use phylogenetic Gaussian process regres- 
sion to model the evolution of simulated function- 
valued traits. Given function-valued data only from 
the tips of an evolutionary tree and utilising inde- 
pendent principal component analysis (IPCA) as a 
method for dimension reduction, we construct distri- 
butional estimates of ancestral function- valued traits, 
and estimate parameters describing their evolution- 
ary dynamics. 

Keywords: comparative analysis; Ornstein- 
Uhlcnbeck process; non-parametric Bayesian infer- 
ence; functional phylogenetics; ancestral reoncon- 
struction 

1 Introduction 

The number, reliability and coverage of evolution- 
ary trees is growing rapidly [T]. However, knowing 
organisms' evolutionary relationships through phylo- 
genetics is only one step in understanding the evo- 
lution of their characteristics Two issues are 
particularly challenging: first, information is typi- 
cally only available for extant organisms, represented 
by tips of a phylogenetic tree, whereas understand- 
ing their evolution requires inference about ances- 
tors deeper in the tree. Second, available informa- 
tion for different organisms in a phylogeny is not 
independent: a phylogeny describes a complex pat- 
tern of non-independence. A variety of statistical 
models have been developed to address these issues 
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(e.g. see [3]). However, most deal with only one 
characteristic of an organism, encapsulated in a sin- 
gle value, at a time. This simplicity contrasts with 
the complexity of any organism which has, not only 
many individual characteristics, but also characteris- 
tics impossible to represent effectively as single num- 
bers. Some such characteristics, for example growth 
curves, may be modelled as function-valued traits [2], 
i.e. as points in an infinite dimensional space. A 
novel approach to analysing the evolution of function- 
valued traits has recently been proposed: phyloge- 
netic Gaussian process regression [5J |S] . Here we de- 
velop a practical implementation of this approach, 
which first linearly decomposes function- valued traits 
for a set of taxa into statistically independent com- 
ponents. This phylogenetically agnostic method of 
dimension reduction is robust to mixed inherited and 
taxon-specific variation in the data (e.g. see [7]). 
Our method then implements the Bayesian regres- 
sion analysis described in [5] , returning posterior dis- 
tributions for ancestral function-valued traits. We 
show that this analysis produces reliable distribu- 
tional estimates for our simulated data, which may 
be further separated into inherited and specific com- 
ponents of variation. We also analyse the statisti- 
cal performance of the method for making point es- 
timates for the (hyper-)parameters [5] describing the 
evolutionary dynamics of the components. Overall, 
our method appropriately combines developments in 
functional data analysis with the evolutionary dy- 
namics of quantitative phenotypic traits, allowing 
nonparametric Bayesian inference from phylogeneti- 
cally non-independent function-valued traits. Details 
of the simulation and inference methods are given 
in section [2] and section [3] gives results and discus- 
sion both for reconstructing ancestral function- valued 
traits and for estimating their evolutionary parame- 
ters. 
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2 Methods 

2.1 Simulating function- valued traits 

We simulate datasets using the Omstein-Uhlenbeck 
(OU) Gaussian process as a model of evolutionary 
change. The technical justification for the broad ap- 
plicability of this model is presented in |2.2| and |2.3| 
However, OU processes are already well documented 
in the evolutionary biology literature [HI HH] , having 
the advantage over simpler Brownian motion mod- 
els pj] of modelling both selection and genetic drift. 
The OU model also exhibits a stationary distribution 
with covariances between character values decreasing 
exponentially with phylogenetic distance [T2"] . 

We first generated a random, non-ultrametric, 128- 
tip phylogentic tree T, with branch lengths drawn 
from an inverse Gaussian (IG) distribution, IG(.5,.5) 
(figure [TjA.) . Function- valued traits were simulated at 
each tip and internal node by randomly mixing a 
common set of basis functions; for each i = 1, 2, 3 
a smooth function (figure IB, Upper-left) was discre- 
tised producing a basis vector <f>i of length 1024. An 
OU Gaussian Process (independent for each i) was 
used to generate weights for the ith component w % at 
each of the 255 nodes of T (128 tips and 127 internal 
nodes), with mean zero and covariance function 
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for tx and t 2 € T, where d,T(ti,t 2 ) denotes the pa- 
tristic distance (that is, the distance in T) between ti 
and t 2 . This covariance function is developed in sec- 
tion IIC2 of [5]: in the present context, <Tj quantifies 
the intensity of inherited variation, A 1 is the charac- 
teristic length-scale of the evolutionary dynamics [8] 
(equivalent to the inverse of the strength of selection), 
and a l n quantifies the intensity of specific variation 
(i.e. variation unattributable to the phylogeny). We 
selected the hyperparameters in table[T] to give differ- 
ent qualities to each of the three components of vari- 
ation. In particular, component 2 (figure [Tj3, upper- 
left, dashed line) has no inherited variation and it fol- 
lows that the characteristic length-scale/strength-of- 
selection parameter A 2 has no meaning for this com- 
ponent. 

Each node in the tree t € T thus had an associ- 
ated vector wt = (u^ , iu 2 , iuj?) giving the weights for 
each of the three basis functions. These weighted ba- 
sis functions were used to produce a single function- 



1 4.5 

2 

3 3.0 



17.9 0.45 
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Table 1: Hyperparameter values used to generate the 
function-valued data shown in figure [l]A The three 
hyperparameters (inherited variation c/, length-scale 
A and specific variation cr„, see text), define the co- 
variance of the basis functions (figure |Tj3, upper- left) 
at different points in the tree according to equation 
0. 



valued trait ft at each node (of which four are shown 
in figure [Tj3 , lower- left panel): 



ft = wj(j> 



(2) 



where cj> is the 3 x 1024 matrix having each <fii as its 
rows. The resulting set of 255 curves ft (t € T) was 
divided into two parts: the 128 curves at tips of T 
to be used as training data (that is, inputs to our 
regression analysis), and the 127 curves at internal 
nodes of T to be used for validation of the method. 



2.2 Dimension reduction and source 
separation for functional data 

Since the space of all (univariate) continuous func- 
tions is infinite-dimensional, if the sample curves 
lie "close" to a finite dimensional linear subspace, an 
acceptable approximation can be obtained by util- 
ising weighted sums of basis curves that span that 
linear subspace. Effectively, this involves reversing 
equation p]): given the curves ft, the task is to esti- 
mate the common basis <fi and the weights Wt- This 
is the challenge of 'source separation' [IB"] . 

A more rigorous justification for this heuristic is 
provided by [S]- It is shown there that a wide class 
of Gaussian processes of function-valued traits on a 
phylogeny have exactly the linear decomposition in 
equation ([2]), where for each of the i basis functions 
the weights w\ themselves form a (univariate) phylo- 
genetic Gaussian process w % on the phylogeny T. The 
task, then, is to estimate the basis <f> and weights wt 
at the tips of T accurately. The standard approach to 
choosing basis curves is PC A [T3], which does so by 
explaining the greatest possible variation in succes- 
sive orthogonal components, an approach extended to 
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Figure 1: A The 128-tip random tree used with function- valued traits at four tips (right), the root and one 
internal node (left). The maximal distance between any two tips is 17.9, the mean distance 7.43. B Panels, 
clockwise from top-left: The common basis functions used to generate data (<j>i, <p2, </>3); results of PCA on 
the training data; results of IPCA on the training data; and the simulated function- valued traits at the four 
tips indicated in A (black lines), with the sample mean curve (red). 



take account of phylogenetic relationships [TS] . How- 
ever, if a sample of functions is generated by mixing 
non-orthogonal basis functions, the principal compo- 
nents of the sample (whether or not they account 
for phylogeny) will not equal the basis curves, due 
to the assumption of orthogonality: see figure [TJ3. If 
we remove the assumption of orthogonality, however, 
we must replace it with an alternative assumption 
in order to have the right number of mathematical 
constraints to perform source separation. In inde- 
pendent components analysis (ICA), the alternative 
assumption made is that the weight components w 1 
are statistically independent for different values of i. 
This assumption fits naturally into our model. 

PCA is an appropriate tool for estimating true 
dimensionality. Therefore, to achieve both dimen- 
sion reduction and source separation, we first applied 
PCA to the training data (the 128 curves at the tips 
of T) to determine the appropriate value for k [13] , 
which was correctly returned as k — 3. The prin- 
cipal components were then passed to the 'CubICA' 
implementation of ICA 16 . ICA has proved fruit- 
ful in other biological applications [17] as has passing 
the results of PCA to ICA, which has been termed 
IPCA [THj. CubICA returned a new set of k com- 
ponents (figure |Tj3, lower-right panel) and, for each 
i, a corresponding weight w\ at each tip t. An in- 



dependent, univariate phylogenetic Gaussian process 
regression was then performed for each value of i, as 



described in 2.3 to obtain posterior distributions for 
the weights throughout the tree. The posterior dis- 
tributions at every node in the tree (one posterior 
for each w l ) were then mapped to a single functional 
posterior distribution. 

Using IPCA we can therefore approximately re- 
construct both the basis 4> and the mixing vectors 
wt from the tip data, in such as way as to be fully 
compatible with the phylogenetic Gaussian process 
framework of [5]. 

2.3 Phylogenetic Gaussian process re- 
gression 

The mechanics of phylogenetic Gaussian process re- 
gression using a basis (j) are discussed in detail in 
5j, section IIC2. As discussed there, the phyloge- 
netic OU process is the only stationary and Marko- 
vian Gaussian process. For each i = 1,2,3, under 
these assumptions we would therefore choose the co- 
variance function given in equation ([!]) , and only the 
hyperparameters 9i = (a^, a % nl A 1 ) would remain to be 
specified (for a fixed phylogeny T). 

In the ancestral inference described in |2.2| we as- 
sume the parameter values in table [l] to be known. 
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For the task of parameter estimation we performed 
the following procedure. For each of the three com- 
ponents <7y j o % n , A 1 in turn, given the values of of the 
other two components we maximise the likelihood ex- 
pression (2) in |5j with respect to the third compo- 
nent to obtain their maximum likelihood estimates, 
and compare these to the true value. 



3 Results and Discussion 

Given a tree T and functional data associated with 
each of its tips, we shall make inferences about an- 
cestral traits and evolutionary parameters. We sim- 
ulate data on the tree in figure [T]^ as described in 
2.1 and we estimate the independently varying basis 
functions 4>i using IPCA (figure[lj3) on tip data alone. 
Using the basis 4> and the process hyperparameters, 
we derive posterior distributions over functional traits 
throughout the tree. Examples of the posterior dis- 
tributions obtained are shown in figure [2] Since spe- 
cific variation (represented by the light grey band) is 
modelled as statistically independent at each point 
in T, the specific variance may simply be subtracted 
from the total posterior variance to obtain the poste- 
rior variance due to inherited variation (whose stan- 
dard deviation is given by the width of the dark grey 
band) . The simulated validation data is also shown in 
black, and can be seen typically to lie within the pos- 
terior standard deviation (given by the width of the 
dark grey plus light grey band) . Note that the contri- 
bution of specific variation to the posterior variance 
is constant across the tree since tip data are silent 
concerning the specific component of the function- 
valued trait at internal nodes. On the other hand, 
the darker grey band decreases in width going away 
from the root, reflecting the decreasing (although not 
entirely vanishing) uncertainty concerning the inher- 
ited component of the function- valued trait as we ap- 
proach the measured traits at the tips. We note that 
the posterior distribution, even at the root, puts a 
clear constraint on the possible ancestral functional 
data: in this (admittedly simulated and highly con- 
trolled) setting we can reason effectively about ances- 
tral function-valued traits. 

We next considered whether parameters describing 
evolutionary dynamics could be estimated, given only 
functional data from the tips of the tree. Specifically, 
we fixed two of the three components of 9i to ran- 
dom values and estimated the third: results shown in 
figures Sl-2. The accuracy of the predictions is very 
encouraging: essentially they are problematic only 



on relatively rare occasions when an over-simplified 
model is fitted excluding either inherited or specific 
variation entirely. Similarly promising statistical per- 
formance was obtained when A was known and knowl- 
edge of either 07 or a n was replaced by knowledge of 
the ratio between them. 

When there is greater uncertainty over the hyper- 
parameters, we anticipate that the use of hyperpriors 
will enable statistical performance to be maintained 
for inference about ancestral function-valued traits, 
but we reserve this for future work. 

R Code for the IPCA, ancestral reconstruction 
and hyperparameter estimation is available from 
http://tinyurl.com/FuncPhylol 
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Figure 2: Posterior distributions at three points in the phylogeny. The prediction made by the regression 
analysis shown via the posterior mean (red line), the component of posterior standard deviation due to 
inherited variation (dark grey band) and specific variation (light grey band). The black line shows the 
simulated data. In the left and centre panels this black line enables validation of the ancestral predictions. 
In the right panel, the black line is the training data at that tip and the posterior distribution should be 
understood as a prediction for an independent organism at the same point in the phylogeny. The root and 
internal node here are the same as those indicated in figure [T]4, and the tip is the second from bottom on 
the right of figure [T]^. 
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